AITopics | Hakha

Collaborating Authors

Hakha

A Appendix

Neural Information Processing SystemsFeb-17-2026, 07:56:21 GMT

The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.

latn, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Oceania > Tonga (0.04)
North America > United States (0.04)
South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
(24 more...)

Industry: Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

A Appendix A.1 LangID Details

Neural Information Processing SystemsOct-9-2025, 08:30:30 GMT

latn, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Oceania > Tonga (0.04)
North America > United States (0.04)
South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
(24 more...)

Industry: Law (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Social Media (0.67)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.46)

Add feedback

Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing

Kim, Jaekyeom, Sohn, Sungryull, Jo, Gerrard Jeongwon, Choi, Jihoon, Bae, Kyunghoon, Lee, Hwayoung, Park, Yongmin, Lee, Honglak

arXiv.org Artificial IntelligenceMar-14-2025

This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone; instead, tracking dataset redistribution and its full lifecycle is essential. However, this process is too complex for legal experts to handle manually at scale. Tracking dataset provenance, verifying redistribution rights, and assessing evolving legal risks across multiple stages require a level of precision and efficiency that exceeds human capabilities. Addressing this challenge effectively demands AI agents that can systematically trace dataset redistribution, analyze compliance, and identify legal risks. We develop an automated data compliance system called NEXUS and show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts. Our massive legal analysis of 17,429 unique entities and 8,072 license terms using this approach reveals the discrepancies in legal rights between the original datasets before redistribution and their redistributed subsets, underscoring the necessity of the data lifecycle-aware compliance. For instance, we find that out of 2,852 datasets with commercially viable individual license terms, only 605 (21%) are legally permissible for commercialization. This work sets a new standard for AI data governance, advocating for a framework that systematically examines the entire lifecycle of dataset redistribution to ensure transparent, legal, and responsible dataset management.

dataset, dependency, trust license, (12 more...)

arXiv.org Artificial Intelligence

2503.02784

Country:

Asia > Japan (0.14)
North America > United States > Wisconsin (0.04)
North America > United States > New York (0.04)
(26 more...)

Genre: Research Report (0.84)

Industry:

Law > Intellectual Property & Technology Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
Government > Regional Government > North America Government > United States Government (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(5 more...)

Add feedback

Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages

Anidjar, Or Haim, Marbel, Revital, Yozevitch, Roi

arXiv.org Artificial IntelligenceDec-31-2024

Approaching Speech-to-Text and Automatic Speech Recognition problems in low-resource languages is notoriously challenging due to the scarcity of validated datasets and the diversity of dialects. Arabic, Russian, and Portuguese exemplify these difficulties, being low-resource languages due to the many dialects of these languages across different continents worldwide. Moreover, the variety of accents and pronunciations of such languages complicate ASR models' success. With the increasing popularity of Deep Learning and Transformers, acoustic models like the renowned Wav2Vec2 have achieved superior performance in the Speech Recognition field compared to state-of-the-art approaches. However, despite Wav2Vec2's improved efficiency over traditional methods, its performance significantly declines for under-represented languages, even though it requires significantly less labeled data. This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. To validate our framework's effectiveness, we conducted a detailed experimental evaluation using three datasets from Mozilla's Common Voice project in Arabic, Russian, and Portuguese. Additionally, the framework presented in this paper demonstrates robustness to different diacritics. Ultimately, our approach outperforms two previous baseline models, which are the pre-trained Wav2Vec2 and the well-known Whisper ASR model, resulting in an average relative improvement of 33.9\% in Word Error Rate and a 53.2\% relative improvement in Character Error Rate.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2501.00425

Country:

Asia > Middle East > Israel (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York (0.04)
(4 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Goldfish: Monolingual Language Models for 350 Languages

Chang, Tyler A., Arnett, Catherine, Tu, Zhuowen, Bergen, Benjamin K.

arXiv.org Artificial IntelligenceAug-19-2024

For many low-resource languages, the only available language models are large multilingual models trained on many languages simultaneously. However, using FLORES perplexity as a metric, we find that these models perform worse than bigrams for many languages (e.g. 24% of languages in XGLM 4.5B; 43% in BLOOM 7.1B). To facilitate research that focuses on low-resource languages, we pre-train and release Goldfish, a suite of monolingual autoregressive Transformer language models up to 125M parameters for 350 languages. The Goldfish reach lower FLORES perplexities than BLOOM, XGLM, and MaLA-500 on 98 of 204 FLORES languages, despite each Goldfish model being over 10x smaller. However, the Goldfish significantly underperform larger multilingual models on reasoning benchmarks, suggesting that for low-resource languages, multilinguality primarily improves general reasoning abilities rather than basic text generation. We release models trained on 5MB (350 languages), 10MB (288 languages), 100MB (166 languages), and 1GB (83 languages) of text data where available. The Goldfish models are available as baselines, fine-tuning sources, or augmentations to existing models in low-resource NLP research, and they are further useful for crosslinguistic studies requiring maximally comparable models across languages.

computational linguistic, dataset, latn 1, (13 more...)

arXiv.org Artificial Intelligence

2408.10441

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > Russia (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(34 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Towards physics-informed neural networks for landslide prediction

Dahal, Ashok, Lombardo, Luigi

arXiv.org Artificial IntelligenceJul-9-2024

For decades, solutions to regional scale landslide prediction have mostly relied on data-driven models, by definition, disconnected from the physics of the failure mechanism. The success and spread of such tools came from the ability to exploit proxy variables rather than explicit geotechnical ones, as the latter are prohibitive to acquire over broad landscapes. Our work implements a Physics Informed Neural Network (PINN) approach, thereby adding to a standard data-driven architecture, an intermediate constraint to solve for the permanent deformation typical of Newmark slope stability methods. This translates into a neural network tasked with explicitly retrieving geotechnical parameters from common proxy variables and then minimize a loss function with respect to the available coseismic landside inventory. The results are very promising, because our model not only produces excellent predictive performance in the form of standard susceptibility output, but in the process, also generates maps of the expected geotechnical properties at a regional scale. Such architecture is therefore framed to tackle coseismic landslide prediction, something that, if confirmed in other studies, could open up towards PINN-based near-real-time predictions.

landslide, lombardo, neural network, (14 more...)

arXiv.org Artificial Intelligence

2407.06785

Country:

Asia > Middle East > Republic of Türkiye (0.14)
Asia > Nepal (0.05)
South America > Venezuela (0.04)
(9 more...)

Genre: Research Report > New Finding (0.67)

Industry: Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Lovenia, Holy, Mahendra, Rahmad, Akbar, Salsabil Maulana, Miranda, Lester James V., Santoso, Jennifer, Aco, Elyanah, Fadhilah, Akhdan, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P., Moniz, Joel Ruben Antony, Habibi, Muhammad Ravi Shulthan, Hudi, Frederikus, Montalan, Railey, Ignatius, Ryan, Lopo, Joanito Agili, Nixon, William, Karlsson, Börje F., Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Cruz, Jan Christian Blaise, Whitehouse, Chenxi, Parmonangan, Ivan Halim, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Ryanda, Reynard Adha, Hermawan, Sonny Lazuardi, Velasco, Dan John, Kautsar, Muhammad Dehan Al, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Adilazuarda, Muhammad Farid, Li, Haochen, Lee, Johanes, Damanhuri, R., Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Leong, Wei Qi, Do, Quyet V., Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Tai, Ngee Chia, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Winata, Genta Indra, Zhang, Ruochen, Koto, Fajri, Yong, Zheng-Xin, Cahyawijaya, Samuel

arXiv.org Artificial IntelligenceJul-8-2024

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

computational linguistic, dataset, sea language, (14 more...)

arXiv.org Artificial Intelligence

2406.10118

Country:

Asia > Southeast Asia (0.24)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Asia > Laos (0.06)
(59 more...)

Genre: Research Report (0.81)

Industry:

Education (0.68)
Information Technology (0.67)
Energy (0.45)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(6 more...)

Add feedback

mHuBERT-147: A Compact Multilingual HuBERT Model

Boito, Marcely Zanon, Iyer, Vivek, Lagos, Nikolaos, Besacier, Laurent, Calapodescu, Ioan

arXiv.org Artificial IntelligenceJun-27-2024

We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

dataset, iteration, representation, (14 more...)

arXiv.org Artificial Intelligence

2406.06371

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
(15 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(2 more...)

Add feedback

Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training

Denisov, Pavel, Vu, Ngoc Thang

arXiv.org Artificial IntelligenceApr-16-2024

Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.

dataset, evaluation, llm, (13 more...)

arXiv.org Artificial Intelligence

2404.10922

Country:

Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Pennsylvania (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

GlotLID: Language Identification for Low-Resource Languages

Kargaran, Amir Hossein, Imani, Ayyoob, Yvon, François, Schütze, Hinrich

arXiv.org Artificial IntelligenceNov-4-2023

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

language identification, natural language processing, resource and evaluation conference, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2023.findings-emnlp.410

2310.16248

Country:

Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
South America > Peru > Huánuco Department > Huánuco Province > Huánuco (0.04)
North America > Mexico > Puebla (0.04)
(84 more...)

Genre: Research Report > New Finding (0.87)

Industry:

Media > Television (0.45)
Health & Medicine > Therapeutic Area > Neurology (0.33)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback